Finance Data Project

In this data project we will focus on exploratory data analysis of stock prices. Keep in mind, this project is just meant to practice your visualization and pandas skills, it is not meant to be a robust financial analysis or be taken as financial advice.


NOTE: This project is extremely challenging because it will introduce a lot of new concepts and have you looking things up on your own (we'll point you in the right direction) to try to solve the tasks issued. Feel free to just go through the solutions lecture notebook and video as a "walkthrough" project if you don't want to have to look things up yourself. You'll still learn a lot that way!


We'll focus on bank stocks and see how they progressed throughout the financial crisis all the way to early 2016.

Get the Data

In this section we will learn how to use pandas to directly read data from Google finance using pandas!

First we need to start with the proper imports, which we've already laid out for you here.

Note: You'll need to install pandas-datareader for this to work! Pandas datareader allows you to read stock information directly from the internet Use these links for install guidance (pip install pandas-datareader), or just follow along with the video lecture.

The Imports

Already filled out for you.

In [1]:
! pip install pandas-datareader
Collecting pandas-datareader
  Downloading pandas_datareader-0.5.0-py2.py3-none-any.whl (74kB)
Requirement already satisfied: pandas>=0.17.0 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas-datareader)
Collecting requests-file (from pandas-datareader)
  Downloading requests-file-1.4.2.tar.gz
Requirement already satisfied: requests>=2.3.0 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas-datareader)
Collecting requests-ftp (from pandas-datareader)
  Downloading requests-ftp-0.3.1.tar.gz
Requirement already satisfied: python-dateutil>=2 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas>=0.17.0->pandas-datareader)
Requirement already satisfied: pytz>=2011k in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas>=0.17.0->pandas-datareader)
Requirement already satisfied: numpy>=1.7.0 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas>=0.17.0->pandas-datareader)
Requirement already satisfied: six in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from requests-file->pandas-datareader)
Building wheels for collected packages: requests-file, requests-ftp
  Running setup.py bdist_wheel for requests-file: started
  Running setup.py bdist_wheel for requests-file: finished with status 'done'
  Stored in directory: C:\Users\071581.TOR.000\AppData\Local\pip\Cache\wheels\3e\34\3a\c2e634ca7b545510c1b3b7d94dea084e5fdb5f33558f3c3a81
  Running setup.py bdist_wheel for requests-ftp: started
  Running setup.py bdist_wheel for requests-ftp: finished with status 'done'
  Stored in directory: C:\Users\071581.TOR.000\AppData\Local\pip\Cache\wheels\76\fb\0d\1026eb562c34a4982dc9d39c9c582a734eefe7f0455f711deb
Successfully built requests-file requests-ftp
Installing collected packages: requests-file, requests-ftp, pandas-datareader
Successfully installed pandas-datareader-0.5.0 requests-file-1.4.2 requests-ftp-0.3.1
In [2]:
from pandas_datareader import data, wb
import pandas as pd
import numpy as np
import datetime
%matplotlib inline

Data

We need to get data using pandas datareader. We will get stock information for the following banks:

  • Bank of America
  • CitiGroup
  • Goldman Sachs
  • JPMorgan Chase
  • Morgan Stanley
  • Wells Fargo

Figure out how to get the stock data from Jan 1st 2006 to Jan 1st 2016 for each of these banks. Set each bank to be a separate dataframe, with the variable name for that bank being its ticker symbol. This will involve a few steps:

  1. Use datetime to set start and end datetime objects.
  2. Figure out the ticker symbol for each bank.
  3. Figure out how to use datareader to grab info on the stock.

Use this documentation page for hints and instructions (it should just be a matter of replacing certain values. Use google finance as a source, for example:

# Bank of America
BAC = data.DataReader("BAC", 'google', start, end)
In [3]:
startd=datetime.date(2006,1,1)

endd=datetime.date(2016,1,1)

BAC = data.DataReader("BAC", 'yahoo', startd,  endd)

#BAC.loc[datetime.date(2006,1,1)]
In [4]:
BAC.head()
Out[4]:
Open High Low Close Adj Close Volume
Date
2006-01-03 46.919998 47.180000 46.150002 47.080002 37.157749 16296700
2006-01-04 47.000000 47.240002 46.450001 46.580002 36.763119 17757900
2006-01-05 46.580002 46.830002 46.320000 46.639999 36.810474 14970700
2006-01-06 46.799999 46.910000 46.349998 46.570000 36.755238 12599800
2006-01-09 46.720001 46.970001 46.360001 46.599998 36.778908 15619400
In [5]:
C = data.DataReader("C", 'yahoo', '2006-01-01', '2016-01-01')
In [6]:
C.head()
Out[6]:
Open High Low Close Adj Close Volume
Date
2006-01-03 490.000000 493.799988 481.100006 492.899994 415.046356 1537600
2006-01-04 488.600006 491.000000 483.500000 483.799988 407.383698 1870900
2006-01-05 484.399994 487.799988 484.000000 486.200012 409.404694 1143100
2006-01-06 488.799988 489.000000 482.000000 486.200012 409.404694 1370200
2006-01-09 486.000000 487.399994 483.000000 483.899994 407.467926 1680700
In [7]:
GS = data.DataReader("GS", 'yahoo', '2006-01-01', '2016-01-01')
In [8]:
GS.head()
Out[8]:
Open High Low Close Adj Close Volume
Date
2006-01-03 126.699997 129.440002 124.230003 128.869995 112.337547 6188700
2006-01-04 127.349998 128.910004 126.379997 127.089996 110.785889 4861600
2006-01-05 126.000000 127.320000 125.610001 127.040001 110.742340 3717400
2006-01-06 127.290001 129.250000 127.290001 128.839996 112.311401 4319600
2006-01-09 128.500000 130.619995 128.000000 130.389999 113.662605 4723500
In [9]:
JPM = data.DataReader("JPM", 'yahoo', '2006-01-01', '2016-01-01')
In [10]:
JPM.head()
Out[10]:
Open High Low Close Adj Close Volume
Date
2006-01-03 39.830002 40.360001 39.299999 40.189999 29.699856 12838600
2006-01-04 39.779999 40.139999 39.419998 39.619999 29.528452 13491500
2006-01-05 39.610001 39.810001 39.500000 39.740002 29.617882 8109400
2006-01-06 39.919998 40.240002 39.549999 40.020000 29.826565 7966900
2006-01-09 39.880001 40.720001 39.880001 40.669998 30.310993 16575200
In [11]:
MS = data.DataReader("MS", 'yahoo', '2006-01-01', '2016-01-01')
In [12]:
MS.head()
Out[12]:
Open High Low Close Adj Close Volume
Date
2006-01-03 57.169998 58.490002 56.740002 58.310001 40.062382 5377000
2006-01-04 58.700001 59.279999 58.349998 58.349998 40.089870 7977800
2006-01-05 58.549999 58.590000 58.020000 58.509998 40.199799 5778000
2006-01-06 58.770000 58.849998 58.049999 58.570000 40.241035 6889800
2006-01-09 58.630001 59.290001 58.619999 59.189999 40.666996 4144500
In [13]:
WFC = data.DataReader("WFC", 'yahoo', '2006-01-01', '2016-01-01')
In [14]:
WFC.head()
Out[14]:
Open High Low Close Adj Close Volume
Date
2006-01-03 31.600000 31.975000 31.195000 31.900000 23.077190 11016400
2006-01-04 31.799999 31.820000 31.365000 31.530001 22.809519 10870000
2006-01-05 31.500000 31.555000 31.309999 31.495001 22.784204 10158000
2006-01-06 31.580000 31.775000 31.385000 31.680000 22.918036 8403800
2006-01-09 31.674999 31.825001 31.555000 31.674999 22.914423 5619600

Create a list of the ticker symbols (as strings) in alphabetical order. Call this list: tickers

In [15]:
tickers = ['BAC','C','GS','JPM','MS','WFC']
tickers
Out[15]:
['BAC', 'C', 'GS', 'JPM', 'MS', 'WFC']

Use pd.concat to concatenate the bank dataframes together to a single data frame called bank_stocks. Set the keys argument equal to the tickers list. Also pay attention to what axis you concatenate on.

In [16]:
bank_stocks = pd.concat([BAC, C, GS, JPM, MS, WFC],axis=1, keys = tickers)

Set the column name levels (this is filled out for you):

In [17]:
bank_stocks.columns.names = ['Bank Ticker','Stock Info']

Check the head of the bank_stocks dataframe.

In [18]:
bank_stocks.head()
Out[18]:
Bank Ticker BAC C ... MS WFC
Stock Info Open High Low Close Adj Close Volume Open High Low Close ... Low Close Adj Close Volume Open High Low Close Adj Close Volume
Date
2006-01-03 46.919998 47.180000 46.150002 47.080002 37.157749 16296700 490.000000 493.799988 481.100006 492.899994 ... 56.740002 58.310001 40.062382 5377000 31.600000 31.975000 31.195000 31.900000 23.077190 11016400
2006-01-04 47.000000 47.240002 46.450001 46.580002 36.763119 17757900 488.600006 491.000000 483.500000 483.799988 ... 58.349998 58.349998 40.089870 7977800 31.799999 31.820000 31.365000 31.530001 22.809519 10870000
2006-01-05 46.580002 46.830002 46.320000 46.639999 36.810474 14970700 484.399994 487.799988 484.000000 486.200012 ... 58.020000 58.509998 40.199799 5778000 31.500000 31.555000 31.309999 31.495001 22.784204 10158000
2006-01-06 46.799999 46.910000 46.349998 46.570000 36.755238 12599800 488.799988 489.000000 482.000000 486.200012 ... 58.049999 58.570000 40.241035 6889800 31.580000 31.775000 31.385000 31.680000 22.918036 8403800
2006-01-09 46.720001 46.970001 46.360001 46.599998 36.778908 15619400 486.000000 487.399994 483.000000 483.899994 ... 58.619999 59.189999 40.666996 4144500 31.674999 31.825001 31.555000 31.674999 22.914423 5619600

5 rows × 36 columns

EDA

Let's explore the data a bit! Before continuing, I encourage you to check out the documentation on Multi-Level Indexing and Using .xs. Reference the solutions if you can not figure out how to use .xs(), since that will be a major part of this project.

What is the max Close price for each bank's stock throughout the time period?

In [19]:
bank_stocks.xs('Close', axis=1, level = 1).max()
Out[19]:
Bank Ticker
BAC     54.900002
C      564.099976
GS     247.919998
JPM     70.080002
MS      89.300003
WFC     58.520000
dtype: float64

Create a new empty DataFrame called returns. This dataframe will contain the returns for each bank's stock. returns are typically defined by:

$$r_t = \frac{p_t - p_{t-1}}{p_{t-1}} = \frac{p_t}{p_{t-1}} - 1$$
In [20]:
BAC_R =bank_stocks['BAC']['Close'].pct_change()
C_R =bank_stocks['C']['Close'].pct_change()
GS_R =bank_stocks['GS']['Close'].pct_change()
JPM_R =bank_stocks['JPM']['Close'].pct_change()
MS_R =bank_stocks['MS']['Close'].pct_change()
WFC_R =bank_stocks['WFC']['Close'].pct_change()
tickersR = ['BAC Return','C Return','GS Return','JPM Return','MS Return','WFC Return']
returns = pd.concat([BAC_R, C_R, GS_R, JPM_R, MS_R, WFC_R],axis=1, keys = tickersR)
returns.head()
Out[20]:
BAC Return C Return GS Return JPM Return MS Return WFC Return
Date
2006-01-03 NaN NaN NaN NaN NaN NaN
2006-01-04 -0.010620 -0.018462 -0.013812 -0.014183 0.000686 -0.011599
2006-01-05 0.001288 0.004961 -0.000393 0.003029 0.002742 -0.001110
2006-01-06 -0.001501 0.000000 0.014169 0.007046 0.001025 0.005874
2006-01-09 0.000644 -0.004731 0.012030 0.016242 0.010586 -0.000158

We can use pandas pct_change() method on the Close column to create a column representing this return value. Create a for loop that goes and for each Bank Stock Ticker creates this returns column and set's it as a column in the returns DataFrame.

Create a pairplot using seaborn of the returns dataframe. What stock stands out to you? Can you figure out why?

In [21]:
import seaborn as sns
sns.pairplot(returns.dropna())
Out[21]:
<seaborn.axisgrid.PairGrid at 0xac902c0e48>
  • See solution for details about Citigroup behavior....

Using this returns DataFrame, figure out on what dates each bank stock had the best and worst single day returns. You should notice that 4 of the banks share the same day for the worst drop, did anything significant happen that day?

In [22]:
returns.idxmin()
Out[22]:
BAC Return   2009-01-20
C Return     2009-02-27
GS Return    2009-01-20
JPM Return   2009-01-20
MS Return    2008-10-09
WFC Return   2009-01-20
dtype: datetime64[ns]
In [23]:
returns.idxmax()
Out[23]:
BAC Return   2009-04-09
C Return     2008-11-24
GS Return    2008-11-24
JPM Return   2009-01-21
MS Return    2008-10-13
WFC Return   2008-07-16
dtype: datetime64[ns]

You should have noticed that Citigroup's largest drop and biggest gain were very close to one another, did anythign significant happen in that time frame?

  • See Solution for details
In [24]:
returns.std()
Out[24]:
BAC Return    0.036647
C Return      0.038672
GS Return     0.025390
JPM Return    0.027667
MS Return     0.037819
WFC Return    0.030238
dtype: float64
In [25]:
returns.std().idxmax()
Out[25]:
'C Return'
In [26]:
returns[returns.index >'2014-12-31'].std()
Out[26]:
BAC Return    0.016163
C Return      0.015289
GS Return     0.014046
JPM Return    0.014017
MS Return     0.016249
WFC Return    0.012591
dtype: float64

Take a look at the standard deviation of the returns, which stock would you classify as the riskiest over the entire time period? Which would you classify as the riskiest for the year 2015?

In [27]:
returns[returns.index >'2014-12-31'].std().idxmax()
Out[27]:
'MS Return'

Create a distplot using seaborn of the 2015 returns for Morgan Stanley

In [28]:
sns.distplot(returns[returns.index >'2014-12-31']['MS Return'].dropna(),bins=120);
C:\Users\071581.TOR.000\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

Create a distplot using seaborn of the 2008 returns for CitiGroup

In [29]:
sns.distplot(returns[(returns.index >'2007-12-31') & (returns.index <='2008-12-31')]['C Return'].dropna(),bins=120);
C:\Users\071581.TOR.000\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

More Visualization

A lot of this project will focus on visualizations. Feel free to use any of your preferred visualization libraries to try to recreate the described plots below, seaborn, matplotlib, plotly and cufflinks, or just pandas.

Imports

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
In [33]:
!pip install plotly
Collecting plotly
  Downloading plotly-2.2.1.tar.gz (1.1MB)
Requirement already satisfied: decorator>=4.0.6 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly)
Requirement already satisfied: nbformat>=4.2 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly)
Requirement already satisfied: pytz in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly)
Requirement already satisfied: requests in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly)
Requirement already satisfied: six in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly)
Building wheels for collected packages: plotly
  Running setup.py bdist_wheel for plotly: started
  Running setup.py bdist_wheel for plotly: finished with status 'done'
  Stored in directory: C:\Users\071581.TOR.000\AppData\Local\pip\Cache\wheels\cc\87\3f\6a282eb21da5d8223472bed40ee023cdcf2e9691b117969a4c
Successfully built plotly
Installing collected packages: plotly
Successfully installed plotly-2.2.1
In [34]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

print(__version__) # requires version >= 1.9.0
2.2.1
In [36]:
!pip install cufflinks
import cufflinks as cf
Collecting cufflinks
  Downloading cufflinks-0.12.1.tar.gz (49kB)
Requirement already satisfied: pandas in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from cufflinks)
Requirement already satisfied: plotly>=2.0.0 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from cufflinks)
Collecting colorlover>=0.2 (from cufflinks)
  Downloading colorlover-0.2.1.tar.gz
Requirement already satisfied: python-dateutil>=2 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas->cufflinks)
Requirement already satisfied: pytz>=2011k in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas->cufflinks)
Requirement already satisfied: numpy>=1.7.0 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from pandas->cufflinks)
Requirement already satisfied: requests in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly>=2.0.0->cufflinks)
Requirement already satisfied: nbformat>=4.2 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly>=2.0.0->cufflinks)
Requirement already satisfied: six in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly>=2.0.0->cufflinks)
Requirement already satisfied: decorator>=4.0.6 in c:\users\071581.tor.000\appdata\local\continuum\anaconda3\lib\site-packages (from plotly>=2.0.0->cufflinks)
Building wheels for collected packages: cufflinks, colorlover
  Running setup.py bdist_wheel for cufflinks: started
  Running setup.py bdist_wheel for cufflinks: finished with status 'done'
  Stored in directory: C:\Users\071581.TOR.000\AppData\Local\pip\Cache\wheels\9d\31\82\40faec2196f9c615b235219f4ae076958c544fa525a4155f42
  Running setup.py bdist_wheel for colorlover: started
  Running setup.py bdist_wheel for colorlover: finished with status 'done'
  Stored in directory: C:\Users\071581.TOR.000\AppData\Local\pip\Cache\wheels\b8\b0\18\76d3f3088cc73950ed1aa44ea074b93ed37309fea57dec78dc
Successfully built cufflinks colorlover
Installing collected packages: colorlover, cufflinks
Successfully installed colorlover-0.2.1 cufflinks-0.12.1
In [37]:
#cf.go_offline()

init_notebook_mode(connected=True)
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
            },
          });
        </script>
        '''))
In [38]:
cf.go_offline()

Create a line plot showing Close price for each bank for the entire index of time. (Hint: Try using a for loop, or use .xs to get a cross section of the data.)

In [41]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)
bank_stocks.xs('Close', axis=1, level = 1).iplot()
In [0]:
 
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d1aea58>

Moving Averages

Let's analyze the moving averages for these stocks in the year 2008.

Plot the rolling 30 day average against the Close Price for Bank Of America's stock for the year 2008

In [42]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)

bank_stocks.xs('Close', axis=1, level = 1).rolling(window=30).mean().dropna().iplot()
In [43]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)
closePrice = bank_stocks.xs('Close', axis=1, level = 1)['BAC']
closePrice2008 = closePrice[(closePrice.index > '2008-01-01') & (closePrice.index <'2009-01-01')]
closePriceMA2008 = closePrice2008.rolling(window=30).mean().dropna()

pd.concat([closePrice2008, closePriceMA2008],axis=1, keys = ['BAC Closing', 'BAC Closing 30d Average']).iplot()

#closePriceMA = bank_stocks.xs('Close', axis=1, level = 1).rolling(window=30).mean().dropna()['BAC']
#closePriceMA2008 = closePriceMA[(closePriceMA.index > '2008-01-01') & (closePriceMA.index <'2009-01-01')].iplot()

Create a heatmap of the correlation between the stocks Close Price.

In [44]:
sns.heatmap(bank_stocks.xs('Close', axis=1, level = 1).corr(),cmap='coolwarm',annot=True)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0xac962c1da0>
In [0]:
 
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ec0ceb8>
In [0]:
 

Optional: Use seaborn's clustermap to cluster the correlations together:

In [45]:
sns.clustermap(bank_stocks.xs('Close', axis=1, level = 1).corr(),cmap='coolwarm',annot=True)
C:\Users\071581.TOR.000\AppData\Local\Continuum\Anaconda3\lib\site-packages\matplotlib\cbook.py:136: MatplotlibDeprecationWarning:

The axisbg attribute was deprecated in version 2.0. Use facecolor instead.

Out[45]:
<seaborn.matrix.ClusterGrid at 0xac968a5860>
In [0]:
 
Out[0]:
<seaborn.matrix.ClusterGrid at 0x11d4ba9b0>

Part 2 (Optional)

In this second part of the project we will rely on the cufflinks library to create some Technical Analysis plots. This part of the project is experimental due to its heavy reliance on the cuffinks project, so feel free to skip it if any functionality is broken in the future.

Use .iplot(kind='candle) to create a candle plot of Bank of America's stock from Jan 1st 2015 to Jan 1st 2016.

In [50]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)
BAC[(BAC.index >='2015-01-01') & (BAC.index <='2015-12-31')].iplot(kind='candle',theme='ggplot' )

Use .ta_plot(study='sma') to create a Simple Moving Averages plot of Morgan Stanley for the year 2015.

In [54]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)
MS[(MS.index >='2015-01-01') & (MS.index <='2015-12-31')]['Close'].ta_plot(study='sma') 

Use .ta_plot(study='boll') to create a Bollinger Band Plot for Bank of America for the year 2015.

In [57]:
#configure_plotly_browser_state()
init_notebook_mode(connected=False)
BAC[(BAC.index >='2015-01-01') & (BAC.index <='2015-12-31')]['Close'].ta_plot(study='boll')

Great Job!

Definitely a lot of more specific finance topics here, so don't worry if you didn't understand them all! The only thing you should be concerned with understanding are the basic pandas and visualization oeprations.